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Investigation of species abundance has become a vital compo¬ 
nent of many ecological monitoring studies. The primary objective 
of these studies is to understand how specific species are distributed 
across the study domain, as well as quantification of the sampling 
efficiency for detecting these species. To achieve these goals, pres¬ 
elected locations are sampled during scheduled visits, in which the 
number of species observed at each location is recorded. This re¬ 
sults in spatially referenced replicated count data that are often un¬ 
balanced in structure and exhibit overdispersion. Motivated by the 
Baltimore Ecosystem Study, we propose Bayesian hierarchical bino¬ 
mial mixture models, including Binomial Conway-Maxwell Poisson 
(Bin-CMP) mixture models, that formally account for varying lev¬ 
els of spatial dispersion. Our proposed models also allow for variable 
selection of model covariates and grouping of dispersion parameters 
through the implementation of reversible jump Markov chain Monte 
Carlo methodology. Finally, using demographic covariates from the 
American Community Survey, we demonstrate the effectiveness of our 
approach through estimation of abundance for the American Robin 
(Turdus migratorius) in the Baltimore Ecosystem Study. 


1. Introduction. Investigation of species abundance is a topic of 
widespread interest in ecology. To estimate and model variation in species 
abundance, predetermined survey points are visited at each sampling oc¬ 
casion and the number of animals detected are recorded. This results in 
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spatially referenced point count data. Such a sampling protocol is easier 
to implement than the traditional capture-recapture experiment [e.g., see 
Williams, Nichols and Conroy (2002) and the references therein], since each 
animal encountered does not have to be distinctly tagged. Nevertheless, 
these spatially referenced data can be utilized to estimate the abundance of 
animals, for which individual tagging might be difficult or even infeasible 
due to the amount of effort involved, for example, in some avian ecology 
surveys. Therefore, to estimate abundance, the development of binomial 
mixture models has drawn significant attention over the past few decades 
[e.g., Carroll and Lombard (1985), Royle (2004), K&y, Royle and Schmid 
(2005), Kery (2008), Webster, Pollock and Simons (2008)]. 

In developing statistical models for count data, the choice of the distribu¬ 
tion function frequently depends on the dispersion associated with the data. 
For equidispersed data (i.e., equal mean and variance), the Poisson distri¬ 
bution is frequently used due to its explicit assumption of equidispersion. 
However, to model overdsipersed data (i.e., the variance is greater than the 
mean), the choice of distribution functions needs to be made [e.g., see Ver 
Hoef and Boveng (2007)]. Often, the negative binomial (NB) distribution 
[Cameron and Trivedi (1998)] is employed, due to a dispersion parame¬ 
ter that conveniently controls the level of overdispersion. Alternatively, the 
Poisson distribution can also be used with a random effect included to relax 
the restrictive assumption of equidispersion. Although the Poisson and NB 
distributions have become the de facto options for count data, neither of 
them accounts for underdispersion (i.e., the variance is less than the mean). 
Admittedly, overdispersion is more common for data arising from ecological 
monitoring studies, while underdispersion is often present for rare event data 
[e.g., Berbers (1989), Ridout and Besbeas (2004), Oh, Washington and Nam 
(2006)]. Nevertheless, cases can arise in ecological monitoring studies where 
the species of interest is less prevalent (due to being rare occurrences). In 
principle, these situations would manifest themselves as underdispersion. 

The Conway-Maxwell Poisson (CMP) distribution [Conway and Maxwell 
(1962)] is an ideal candidate for modeling count data with different types of 
dispersion, as it has an extra dispersion parameter that flexibly allows for 
equi-, over-, and underdispersion. Moreover, the CMP distribution is closely 
related to many other discrete distributions. For example, the CMP distri¬ 
bution contains the Poisson distribution as a special case and generalizes 
Bernoulli and geometric distributions in the limiting cases [Shmueli et al. 
(2005)]. Owing to its versatility, the CMP distribution has become increas¬ 
ingly popular among many subject-matter disciplines. For example, in the 
context of breeding bird surveys, Wu, Holan and Wikle (2013) develop a 
Bayesian hierarchical spatio-temporal CMP model for complex and high¬ 
dimensional count data. A unique aspect of this research is that it allows 
for dynamic spatial dispersion (i.e., the dispersion over the spatial domain 
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evolves over time). A comprehensive overview regarding the CMP model is 
provided by Sellers, Borle and Shmueli (2012) and the references therein. 

Binomial mixture models have become increasingly popular for analyzing 
spatial point referenced count data in the context of estimating and mod¬ 
eling variation in species abundance. As a result, various models have been 
developed with this application in mind. For example, Carroll and Lombard 
(1985) consider a Binomial-Beta mixture model to study the problem of esti¬ 
mating an unknown population, N, that follows a discrete uniform distribu¬ 
tion, in which efficient estimators were obtained through the use of an inte¬ 
grated likelihood method. To improve the estimator proposed by Carroll and 
Lombard (1985), Royle (2004) develops a Binomial-Poisson (Bin-Pois) mix¬ 
ture model, in which N is considered to be an independent random variable 
from a Poisson distribution. Subsequently, Royle and Dorazio (2006) propose 
a more general hierarchical modeling framework with the goal of addressing 
animal abundance in the case of imperfect detection, wherein the variation 
associated with the observed data was partitioned into that of abundance 
and that of detectability. In the context of avian ecology studies, Kery, Royle 
and Schmid (2005) and Kery (2008) apply the Bin-Pois models to the es¬ 
timation of bird abundance. Webster, Pollock and Simons (2008) propose 
a Bin-Pois model, in which a conditional autoregressive (CAR) model was 
used to address spatial dependence found in the bird density. Wenger and 
Freeman (2008) develop zero-inflated Bin-Pois and zero-inflated Binomial¬ 
negative binomial (Bin-NB) models for the estimation of species abundance. 
Kery and Royle (2010) develop a Bin-Pois model with a site-specific random 
effect to allow for overdispersion and, thus, the equidispersion assumption of 
the Poisson distribution is relaxed. Graves et al. (2011) apply the Bin-Pois 
model to estimate abundance for a grizzly bear population using multiple 
detection methods, in which covariates are introduced to explain variation 
in both the detection and intensity process. Under the frequentist frame¬ 
work, Bail and Madsen (2011) propose a general Bin-Pois model to allow 
for a formal statistical test regarding the assumption of population closure. 
However, none of the aforementioned models simultaneously allows for data 
with different levels of dispersion (over- and underdispersion) and Bayesian 
model selection (e.g., using the Conway-Maxwell Poisson distribution and 
reversible jump Markov chain Monte Carlo). 

Some experiments in ecological studies can be viewed as a robust design 
[e.g., see Pollock (1982)], that is, there are secondary, and possibly subse¬ 
quent, sampling periods nested within each primary sampling occasion. For 
example, the American Robin (Turdus migratorius) data we consider from 
the Baltimore Ecosystem Study (BES) falls into this category. This nested 
sampling design contains the design with one primary sampling occasion as a 
special case. Motivated by American Robin data from BES (Section 6), we 
develop a Binomial Conway-Maxwell Poisson (Bin-CMP) mixture model 
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that accommodates both overdispersed and underdispersed data under a 
nested/unbalanced data structure. The Bin-CMP models we propose are 
cast in a general Bayesian hierarchical binomial mixture model framework 
that can accommodate mixtures using distributions other than the CMP. 

Compared with the existing models in the literature, our contribution 
can be seen as follows. First, we develop a flexible class of binomial mixture 
models to account for replicated count data with different types of disper¬ 
sion, which is achieved by choosing a suitable model for the abundance 
parameter (e.g., using the CMP distribution). In the case of overdispersed 
data, our methodology is advantageous from an estimation perspective when 
compared to the general modeling strategy that includes a random effect to 
account for extra dispersion [e.g., see Kery and Royle (2010)], as our model 
has a fewer number of parameters to be estimated. Although each parameter 
may be more computationally expensive, compared to the strategy of includ¬ 
ing a random effect, this computational burden can be alleviated through 
the use of a lower level programing language and parallel computation. More 
importantly, our model provides an explicit quantification of dispersion and 
can also be used in the context of underdispersed data. Additionally, the 
models we consider can flexibly account for spatial dependence in species 
abundance by adding a low-rank spatial component to the model for the 
intensity process. In contrast to the CAR models used by Webster, Pol¬ 
lock and Simons (2008), our methodology does not require us to define a 
neighborhood structure for the point count data, which can be difficult in 
many cases. In the setting of our motivating example, where the bird counts 
themselves are modeled at the point level rather than on areal units, a geo- 
statistical approach may be more appropriate. Further, through reversible 
jump Markov chain Monte Carlo (RJMCMC), we introduce automated vari¬ 
able selection for covariates and grouping of dispersion parameters into the 
binomial mixture modeling framework and, to the best of our knowledge, 
our approach constitutes the first successful RJMCMC implemented on the 
CMP dispersion parameters. Last, the variable selection allows us to iden¬ 
tify important predictors related to high detectability and abundance for a 
given species of interest. 

This paper is organized as follows. Section 2 introduces our motivating 
data from the BES and provides preliminary background information on 
the CMP distribution. Section 3 describes our proposed Bayesian hierar¬ 
chical binomial mixture models, including the Bin-CMP model. Section 4 
provides relevant information on Bayesian variable selection and grouping 
using RJMCMC. Simulated examples are presented in Section 5, illustrating 
the effectiveness of our modeling approach. Section 6 contains an analysis of 
our motivating data, estimating abundance of the American Robin from the 
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BES, and demonstrates the utility of our methodology. Discussion is pro¬ 
vided in Section 7. For convenience of exposition, specihc details surround¬ 
ing our Markov chain Monte Carlo (MCMC) algorithm and full conditional 
distributions are left to a supplemental article [Wu et al. (2015)]. 

2. Data and preliminary backgronnd. 

2.1. Baltimore Ecosystem Study survey data. As a long-term ecological 
monitoring study, the BES considers the City of Baltimore, Maryland as 
a study area, with the objective of understanding how the City of Balti¬ 
more evolves as an ecosystem over time [Pickett et al. (2011)]. Collected 
as a part of the BES, the American Robin ( Turdus migratorius) data we 
consider constitutes spatially replicated point count data on 132 bird census 
points in the City of Baltimore, which are randomly selected from a set of 
urban forest effect (UFORE or I-Tree Eco) model points (Section 6). Consid¬ 
ered as the most widespread North American thrush, the American Robin 
has become common in many North American cities [Sallabanks and James 
(1999)]. Despite its abundance, conservation measures, which are enforced 
by the Migratory Bird Treaty Act of 2004, have been taken to protect the 
American Robin throughout its geographical range in the United States. Al¬ 
though BES data have been collected across bird survey points since 2005, 
as an illustration, we consider a subset of data over five years from 2005 
to 2009, due to incomplete data in later years. In each year, three surveys 
were scheduled for each of the survey points throughout May and August, 
each of which consisted of a five minute survey conducted between 5 am 
and 10 am on days without rain. During each survey, the recorded count 
represents the combination of birds that were seen, heard, or flew over each 
survey point. In the current context, the secondary sampling period con¬ 
sists of the five minute daily survey, while the primary sampling periods are 
the time frames determined by the dates on which three daily surveys are 
conducted. As a result, the nested sampling design provides a maximum of 
15 spatially referenced counts for each bird census point. Despite the fact 
that several species are available in the BES, as an illustration, we consider 
American Robin counts in our analysis, due to their higher abundance rel¬ 
ative to other species. Among the 132 bird census points, 131 of them have 
American Robin detections (Figure 1). 

2.2. The Conway-Maxwell Poisson distribution. Let X denote a CMP 
distributed random variable, that is, X ~ CMP(A, i^), where A > 0 and > 0 
are the CMP intensity and dispersion parameters, respectively. The proba¬ 
bility mass function (pmf) of X is given by 

A"^ 1 


( 1 ) 


P{X = x) 


ixirZiX,n) 


x = 0,l,2,.... 
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Fig. 1. Plot of 131 bird census points for American Robin in the City of Baltimore, 
Maryland (using R package “RgoogleMaps”). The solid circles are bird census points. 

where 

P) Z(A,.) = X: — 

j=0 

is a normalizing constant (often referred to as the “Z-function”). With the 
additional parameter za, the CMP distribution conveniently accommodates 
equidispersion, overdispersion, and underdispersion. Specifically, v = 1 cor¬ 
responds to the Poisson distribution, whereas v < 1 and > 1 represent 
overdispersion and underdispersion, respectively. In addition, the CMP dis¬ 
tribution generalizes to the geometric and Bernoulli distributions in the 
limiting cases [Shmueli et al. (2005)]. 

For the calculation of (1), the Z-function needs to be computed numer¬ 
ically due to the summation of an infinite series. For certain combinations 
of A and v, many terms will be needed in order to truncate the inhnite 
summation with sufficient accuracy, which leads to intensive computation. 
For these cases, Minka et al. (2003) derived an asymptotic approximation 
to the Z-function, which is accurate when A > 10^. Wu, Holan and Wikle 
(2013) discuss further improvements on computation by taking advantage 
of parallel computing through Open Multiprocessing (OpenMP) and Com¬ 
pute Unified Device Architecture (CUBA), that is, graphics processing unit 
(GPU). 
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3. Hierarchical Binomial mixture models. 

3.1. Model development. Let G H C denote a set of sam¬ 

pling locations. We consider an experimental design in which animals are 
surveyed at each sampling location Sj for a total of J primary sampling oc¬ 
casions, in which there are potentially K nested secondary sampling periods. 
In principle, the primary sampling occasions can be over any arbitrary time 
interval, for example, in weeks or months. In addition, we assume a closed 
population within each primary sampling occasion so that the species abun¬ 
dance at each location varies across primary sampling occasions but not 
within. Relative to the primary sampling occasion, the secondary sampling 
period might be over a shorter time interval, for example, daily surveys 
within the three-month long primary sampling occasions. To allow for an 
unbalanced data structure, due to missing observations, we assume Uij < K 
successful visits to site Sj during the jth primary sampling period with the 
number of animals detected recorded. Therefore, it follows that 0 < Uij < K, 
i = 1,2,..., G; j = 1,2,..., J. We note that “missing” values are not uncom¬ 
mon and can occur for many reasons. For example, some scheduled visits 
might not be made due to illness of the observer, and as a result no data 
will be recorded. In the current context, we assume that any missing data 
are missing completely at random (MCAR) [Little and Rubin (2002)]. 

For i = 1,2,..., G, j = 1,2,..., J, and k = 1,2,... ,nij, let yijk be the num¬ 
ber of animals observed at location Sj during the A:th secondary sampling 
within the jth primary sampling occasion. The observed data can be denoted 
by Y = {yij G = 1,2,..., G; j = 1,2,..., J}, where yij = {yiji,yij 2 , • • •, yijmj 
and 1 < Hij < K. Note that Ujj = 0 corresponds to the case that no success¬ 
ful visits are made to site i and, thus the vector yij does not have any 
elements. Further, let pijk be the probability of detecting an animal during 
the kth [k = 1 , 2 ,..., Uij) secondary sampling within the jth primary sam¬ 
pling occasion {j = 1,2,..., J) at location Sj and denote Nij as the unknown 
animal abundance at location Sj during the jth primary sampling occasion. 
In other words, Wj represents the total number of animals available for sam¬ 
pling during the jth primary sampling occasion at location Sj. Due to the 
closed population assumption, does not vary among secondary sampling 
periods within each primary sampling occasion. 

The nested design we consider is more general than many of the designs 
previously investigated [e.g., Royle (2004), Kery, Royle and Schmid (2005), 
Royle and Link (2005), Royle and Dorazio (2006), Kery (2008), Webster, 
Pollock and Simons (2008)], all of which can be seen as a special case of 
ours by setting K = 1. In contrast, our study design is more similar to those 
found in Chandler, Royle and King (2011) and Bail and Madsen (2011). 
Additionally, for the sake of flexibility, it is not necessary that = K (for 
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all i = 1,2,..., G and j = 1,2,..., J). Importantly, the replicated data col¬ 
lected in the secondary sampling provides additional information that could 
alleviate potential issues caused by missing values as well as improve the 
accuracy of parameter estimation over the nonnested design. The primary 
objective of our analysis is to estimate abundance and draw inference about 
detectability. To achieve these goals, we propose a class of hierarchical bi¬ 
nomial mixture models, that includes the Bin-CMP model. 

The class of binomial mixture models naturally fits into the hierarchical 
framework [e.g., Royle and Dorazio (2008), Cressie and Wikle (2011)]. In 
this framework, we define the observation model as 

(3) yijk\d^ij iPijk ^ )Pijfc) j 

for i = 1,2,..., G; j = 1,2,..., J; A; = 1,2,..., njj, where the probability pijk 
corresponds to the kth. secondary sampling within the jth primary sampling 
occasion at location Sj. For the design we consider, (3) allows us to estimate 
abundance parameters Nij, which are both location- and time-specific. Also, 
since the abundance at each site Sj varies over time, we are able to de¬ 
scribe the temporal changes in species abundance for all spatial locations, 
which is often vital in the context of long-term ecological monitoring studies. 
Another benefit of the design we consider is the potentially sharper estimates 
of the detection probability. Using a single probabilistically coherent model, 
we are able to provide spatial maps that illustrate the changes in abundance 
over time as well as the spatial variation [e.g., see Figures 2 and 3 in the 
supplementary article, Wu et al. (2015)]. More importantly, (3) also sug¬ 
gests how over- and underdispersion can be explicitly accounted for in the 
subsequent model development through the choice of an appropriate count 
model for abundance parameter, Nij. Specifically, under the assumption of 
independence between Nij and Pijk, it follows that 

EiVijk) = E{pijk)E{Nij), 

Vavivijk) = E{pijk)E{N,j) + E{pl„){NaT{Nij) - E{N,j)}. 

Hence, the mean and variance relationship in the data can be addressed 
through that of Nij. For example, for data with over- and underdispersion, 
we can choose a model for Nij such that Var(A^jj) > E[Nij) or Var(A^jj) < 
E{Nij), respectively. As such, our approach addresses over- and underdis¬ 
persed count data through the choice of an appropriate model for abundance 
parameter, Nij. 

For i = 1,2,... ,G and j = 1,2,..., J, the process model we consider for 
the abundance, Nij, is given by 

(4) Nij\Xij , zzj ~ J(Ajj ,1'j), 
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where /(•) is used to generically denote an appropriate count distribution 
with intensity parameter Ajj and primary sampling period-varying disper¬ 
sion parameters vj. There are many possible choices for the distribution 
function /(•) in the process model (4), including the Pois, NB, and CMP, 
among others. We focus on the case where /(•) is chosen to be the CMP 
distribution, resulting in a flexible Bin-CMP mixture model that allows for 
equi-, over-, and/or underdispersion. Alternatively, if /(•) is chosen to be 
the NB distribution, the resulting Bin-NB mixture model provides a suit¬ 
able candidate for modeling overdispersed data. Finally, it is important to 
note that, although we focus on the CMP distribution, in our framework, 
/(•) can be chosen to be any valid count distribution. 

Specification of the parameter model is usually problem-specihc and often 
depends on the research questions under consideration. In long-term ecolog¬ 
ical monitoring studies, it is often of interest to understand which factors 
might be important constituents in the probability of detection, so that an 
efficient sampling protocol can be designed. To achieve this goal, we relate 
the detection probability, pijk, to the covariates Xijkp, ■ ■ ■ ,Xijk,p through a 
logistic link function, that is, 

(5) logit(pijfc) = PiXijk,! -\ -h j3pXijk,p, 

where logit(r) = log{r/(l — r)}, f = 1 , 2 ,..., G, j = 1 , 2 ,..., J, and /c = 1 , 2 , 
Note that (5) allows for an intercept, by setting Xijkp = 1 for all 
i, j, and k. By incorporating covariates into the model, the objective is to 
identify and draw statistical inference on important factors governing the 
probability of detection. Another interest in long-term ecological studies is 
to gain deeper understanding surrounding the intensity Xij, which influences 
species abundance. The second part of the parameter model defines a model 
for the intensity, Xij, as 

log Xij = W- -7 = Wij^i'Ji -h Wij^MlM, 

( 6 ) 

^ = l,...,G;j = l,..., J. 

Here, w^- = {wiji,... ,Wij^My are a set of covariates and 7 = ( 71 ,... ,'yMy 
denotes the associated coefficients. 

3.2. Accounting for spatial dependence. For spatially replicated count 
data, such as those typically encountered in monitoring studies, it is some¬ 
times necessary to explicitly account for spatial dependence in the model 
for intensity. Under this scenario, we can extend ( 6 ) to explicitly incorpo¬ 
rate spatial dependence by adding a spatial component in the model for the 
intensity, that is, 

(7) logAij = wh 7 -L i = l,...,G-,j = 
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or 

log A = w '7 + ($ (g) a') vec(lT-xr), 

wtioro cxj ,..., , cx 5 • • • 5 ^ (^^ii; • • •; ? • • • ? '^G'l; 

w = (wii,...,wij,...,wgi,...,wgj); $ denotes a G x r matrix 
of spatial basis functions $ = ...; 0 ' = ..., is a row vec¬ 

tor denoting the zth row of It-xt is a r x r identity matrix; r is the 
number of basis functions and a ~ A^(0, Xlc). There are several advantages 
to incorporating spatial effects when modeling the intensity function. Most 
importantly, capturing spatial dependence in the intensity function among 
neighboring locations will allow us to borrow strength from correlated ob¬ 
servations, potentially improving parameter estimation, statistical inference, 
and prediction. 

The choice of basis functions is typically problem specific, with advantages 
arising from specihc choices. Popular choices include empirical orthogonal 
functions (EOFs), Fourier basis function, splines, wavelets, bi-square and 
predictive process basis [e.g., see Royle and Wikle (2005), Cressie and Jo- 
hannesson (2008), Cressie and Wikle (2011) and the references therein]. In 
spatial statistical modeling, low-rank representations are often considered 
[Wikle (2010)]. Following Ruppert, Wand and Carroll (2003), we use the 
thin plate spline basis functions, where 

^ = [C{si- Ki)]i<i<j and G(r) = log ||r||, v>l, 

1<1<T 

where ft; (/ = 1 , 2 ,..., r) denote hxed knot points in and u is a smoothness 
parameter [see Holan et al. (2008) for further discussion]. Here, we choose v = 
2 [cf. Ruppert, Wand and Carroll (2003), page 257] and assume cov(aj) = 
where 

Cl = [G(ft; - ft;/)] . 

The selection of knot points can be facilitated through space-filling de¬ 
signs, as implemented in the fields package [Furrer, Nychka and Sain 
(2012)] in R [R Development Core Team (2013)]. The number of knots r 
can be chosen based on computational considerations followed by sensitiv¬ 
ity analysis. Alternatively, the number of knots can be chosen according to 
r = max{20,min(G/4,150)} [Ruppert, Wand and Carroll (2003), page 257]. 
Following Ruppert, Wand and Carroll (2003), we define - 1/2 and 

ct* = Then, for z = 1,2,..., G and j = 1,2,..., J, we can rewrite (7) 

as 

( 8 ) log = WC 7 -k cj}*'a* = , 

where (p*' is the zth row of the matrix and cov(q:}) = a'^.lrxr- Further, 
S'ij = (wC^f) and 7 ^. = ( 71 ,..., 7m, a*i, • • •, a*^)'■ 
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3.3. The likelihood. To account for spatial dependence, we require that 
j = 1,2,..., J in ( 8 ) are in the model with probability one. Since ( 6 ) and 
( 8 ) are essentially of the same form, we will use the former in the subsequent 
discussion. We now derive the likelihood function for the model defined by 
(3), (4), (5), and ( 6 ). Let M. = and denote 

the model structures for the set of covariates x and w and the dispersion 
parameters v = {ui ,..., vj}, respectively. For example, in the case of P = 
6, M = 6, J = 5, = {xi, X 3 } indicates that only xi and X 3 are included in 

the model for detection probability or, equivalently, /32 = /d 4 = /ds = /de = 0 ; 
Ai-f = {wi,W 2 } indicates that only wi and W 2 are included in the model for 
intensity; = {1,2,..., J} indicates that there is only one grouping for 
dispersion parameters, meaning vj = v for j = 1,2,..., J. Under the assump¬ 
tion of conditional independence, the likelihood function for the binomial 
mixture models we propose is given by 


G J rii. 


(9) C{Y\Ai,l3,-f,u) = nnn [yijk\Afij ) \^ij 1 -^ 7 ) T ) 


i=lj=lk=l 


where, generically, [^| 0 ] denotes the conditional distribution of ^ given the 
parameters 6. Integrating out Nij in (9) yields the marginal distribution of 
observing yij as 


P{yij\Ai,p,^,i'j) 


( 10 ) 


E n 


^ij • Vijk 




Uijk^-iAij Uijk) 


Pijk^-Pijkr^ 


X f[Nij\A4~y,y,Aii,,Vj), 


where yfj^^ = maxjyjj}. Consequently, we can derive the joint posterior 
distribution function 7 r(Al,/ 3 , 7 ,i>'|Y) based on (10) as 


7r(Ad,/3,7,i^|Y) 

{ G J 

i=lj=l 

X [/3|M/3][ 71 -^ 7 ] [dW/s] [-^ 7 ] ■ 

Here [0] denotes the joint prior distribution function of the parameters 6. 

Examination of (11) raises several computational concerns. First, the cal¬ 
culation of P[yij\Ai,(3,^,Uj) can be computationally prohibitive, since a 
multiple integral is involved. This computational issue becomes exacerbated 
when the domain of Wj covers a wide range of values and/or if G and J 
are large. In addition to calculating a multiple integral, in the case where 
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/(•) denotes the CMP distribution, evaluating (10) requires computing the 
Z-function, which involves the summation of infinite series. Specifically, for 
the Bin-CMP model, it is worth pointing out that within each MCMC it¬ 
eration, sampling elements in 7 or i/ from their full conditionals requires 
both the computation of the multiple integral and the approximation of the 
Z-function. Therefore, implementation of our proposed model can be com¬ 
putationally intensive in some cases. We resolve these computational issues 
through the use of low level programming in C and parallel computing with 
OpenMP. 

Finally, we assume the following prior distributions for the model param¬ 
eters: /3 Gau(/ 2 ^,I]/ 3 ); 7 r\j For the dispersion parameters, 

we assume Uj ~ Unif(aj, 6 j), j = 1,2,..., J, where aj and bj are chosen ap¬ 
propriately to allow for different levels of dispersion in the data (e.g., for 
overdispersed data, one may set Uj = 0.02 and bj = 1.0). In our case, we 
assign vague prior distributions that are noninformative relative to scale of 
the data. 

4. Automated Bayesian model selection. For the binomial mixture mod¬ 
els we propose, there are several ecological objectives. First, there is a clear 
need to identify important covariates among a set of candidate covariates in 
order to gain an understanding of the factors affecting the detectability for 
a given species of interest. In addition, the selection of influential covariates 
is vital for studying which factors influence species abundance. Last, the 
grouping of dispersion parameters will provide us with further information 
about the level of dispersion associated with the data across different years 
in the study. In such cases, grouping is desired since some years may exhibit 
a similar level of dispersion due to environmental changes or other exogenous 
factors. For example, in our setting, specihc neighborhoods may experience 
slow growth in terms of the number of buildings established and/or certain 
climate conditions may be more (or less) similar from year to year. Thus, it is 
conceivable that some years may experience a similar dispersion parameter. 
As such, we allow for data-driven grouping of the dispersion parameters. To 
achieve these goals, we first discuss variable selection and grouping in the 
context of the models we propose. 

4.1. Bayesian variable selection and grouping. The literature on Bayesian 
variable selection is fairly extensive [e.g., see O’Hara and Sillanpaa (2009), 
Hooten and Hobbs (2015) for a comprehensive review]. Among the many 
available choices, the two most commonly used techniques are stochastic 
search variable selection [George and McCulloch (1993, 1997)] and reversible 
jump MCMC (RJMCMC) [Green (1995)]. For grouping, however, RJMCMG 
is typically considered more appropriate and, thus, we utilize it for both 
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model selection and grouping. Although one could consider model selec¬ 
tion through various model selection criteria [e.g., Deviance Information 
Criterion—Spiegelhalter et al. (2002)], this would be less advantageous when 
the goal is both simultaneous variable selection and grouping. 

For convenience of exposition, we explain our algorithm in the context of 
the Bin-CMP model and note that the migration to other binomial mixture 
models is analogous. The implementation of variable selection for x and 
w involves two types of moves: BIRTH (B) and DEATH (D) defined as 
follows: 

B: propose to add a covariate {xm or Wm) to the current model with prob¬ 
ability 

D: propose to remove a covariate {xm or Wm) from the current model with 
probability p^. 

As an example, we consider a D move for x. In general, only a subset of 
covariates are subject to variable selection, while others are forced to re¬ 
main in the model with probability one. For notational simplification, let 
Ax denote the set of indices corresponding to covariates x that are available 
for variable selection. For example, if there are three covariates xi, X 2 , and 
X 3 available and only xi and x^ are subject to variable selection (i.e., X 2 is 
in the model with the probability 1), then we have A^, = {1,3}. Moreover, 
let jA^,] denote the cardinality of the set A^,. For each covariate in A^,, we 
assume an equal probability of a B or D move, that is, 

pL=p'L = 1/2 for m€Ax. 

Suppose at the current iteration t, the model structure is given by = 
The RJMCMC algorithm for variable selection on x can 
be outlined as follows: 

Step 1: Start with the model structure Ad* = where Ad^ = 

(xii ,...,Xi^} with /3* = 1/3*1 

Step 2: Randomly draw an index from A^, with an equal probability l/|Aa;|. 
Assume is G A^, is chosen: 

- if € Ad^, then propose a D move and obtain \ {x*^} 

and M' = Mi} and /3'= {/3*i,...,/3*, =0,...,/?*^}; 

- otherwise propose a B move and obtain M'^^ = Ad^ U } and 
M' = {M'pMi , Mi] and (3' = {/3*i,..., /3*„„ /3* J. 

Step 3: Adjust the coefficient /3js corresponding to the covariate x*^: 

- if a D move, set /3j<j = 0; 

- otherwise generate /3js ~g('). 
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Step 4:-. Generate u ~ Unif(0,1): 

- if M < min{l, x R}, then set and 

- otherwise and = J\4^. 


Step 5: Repeat. 

In terms of the proposal distribution q{-), we used a Gau(0,C) distribution 
with being a user-defined tuning parameter. Moreover, 


R = 


\ X (liPis), 
Pis 


Pi 


1 


.pi QiPis)'' 


if D move, 


if B move. 


and 


BF(M^,M^) 




We now discuss the grouping algorithm for the dispersion parameters v. 
Assume there are rit different arrangements ri,T 2 ,... for i/ at the tth 
iteration of the MCMG, that is, = {Ti, r 2 ,..., T^, ■.., Tnt}- For each 
grouping T^, m= 1,2,... ,nt, the corresponding elements are subscripts for 
the dispersion parameter group membership. For example, if n* = 1, we have 
Ti = {1,2,..., J}, that is, uj = i^, for j = 1,2,..., J. Similar to the variable 
selection previously described, we allow for two types of moves as follows: 


G: propose to combine two different arrangements into one arrangement 
with Pc, 

S: propose to split the arrangement into two arrangements with probability 
Ps- 


Without loss of generality, assume an equal probability of proposing a G 
or S move, that is, Pc = Ps = 1/2- As an illustration, we describe only 
the S move. Suppose there are nf out of nt arrangements in }A\j that 
have more than one single element. We randomly choose each of these 
nf arrangements with an equal probability. Assume that group Tm is cho¬ 
sen, where m G {!,...,nf} and \Tm\ > 1- Assuming we split Tm into two 
nonempty sets T^i and we denote the resulting model structure as 
= {Ti,T 2 , ... ,Tm^,Tm 2 i ■ ■ ■ iTnt}- The RJMGMC algorithm for group¬ 
ing of V can be outlined as follows: 





BAYESIAN BINOMIAL MIXTURE MODELS 


15 


Step 1: Calculate the probability and as 


P{Mu\M',) 


1 1 1 

2^2(l^-l-b - 1’ 

1 1 

2^ 


[King and Brooks (2002)]. 

Step 2: Let t'm denote the value common to all dispersion parameters in 
Tm and r'mi and r'mj be the values of dispersion parameters in T^i 
and Tmj, respectively. Define a bijective mapping between Vm and 

^mi ? ^m2 


— ^ra “ 1 “ ^ and l^m2 — ^5 

where e ~ h(*). 

Step 3: Generate ^ ~ Unif(0,1): 

- if ^ < min{l,x Rg}, then set and 

- otherwise = Ail, and = Ai^. 


In terms of the proposal distribution /i(-), we used h{r]) = Unif(—r/jT/) where 
rj is chosen through pilot tuning. Moreover, 


BFiAil,Ail) = 


Rg = 


P{M'^ 1 ^mi 5 ^7712 \Y,A4\,j,A4^^,P^) 


P{A\ 


ui ^rn 




1 


P{MAM’,) _ 

P{Ai'^\Ai^.) hie) 


d{u„ 




5. Simulated examples. To evaluate the performance of the binomial 
mixture models we propose, we considered two simulated examples using 
the Bin-CMP model, the difference of which only resides in whether or not a 
spatial component is included in the intensity model. For both simulations, 
we choose G = 131, J = 5, and KT = 3 to be the same as the American 
Robin data presented in Section 6. For both examples, we simulate data as 
yijk\Aij,Pijk ^Bi^iNij,pijk). For the probability of detection, we consider 

logit(pijfc) = PlXijk,! + l32Xijk,2 H-h PpXijk^P, 

where the values for the covariates x are set to be the same as in the Ameri¬ 
can Robin data for i = 1,2,..., G, j = 1,2,..., J, A: = 1,2,..., AT, ^ = 1,2,..., 
P = 4. In addition, we set /3 = (—2.31,—0.4,0.0,—0.4)' with {xi,X 2 ,X 4 } be¬ 
ing important covariates. For the true abundance parameters Nij, we sim¬ 
ulated from Nij ~ CMP(Ajj,i^j), with vi = = 0.15, V 2 = = 0.06 
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and 7 q = (0.31,0.13,0.44,0.16,0.35)', as estimated from the American Robin 
data presented in Section 6. For i = 1,2,..., G and j = 1,2,..., J, the inten¬ 
sity Xij is simulated according to 

SI: log Xij = w'j-f + joj, 

S2: log Xij = w'7 'joj, 

where (f>*' for i = 1,2,..., G and 7o = (701, • • •, 705)^ are determined according 
to the American Robin data with r = 10. In each of the two models, Wj are 
set to be the same as in the American Robin data presented in Section 6. Fur¬ 
ther, we set M = 11 and 7 = (0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.03,0.0)', 
that is, with {rcgjrcio} being important covariates. Particularly, for S2, the 
coefficients of spatial components, ct, are randomly sampled from Unif(0,1) 
to avoid yjjfc being too large. For the two simulations, we apply RJMCMC to 
perform variable selection and grouping. Similar to the analysis presented in 
Section 6, we require o: to be included in the model with probability one for 
S2 and set Uj = 0.02 and bj = 2.0 to allow for both over- and underdispersion. 
In addition, we set fii^ = = 0, = lO^Ip, and = 10^1^. 

Table 1 provides the posterior marginal probabilities for the most probable 
model for x, w, and u in the Bin-CMP models SI and S2. For model SI, 
the most frequent detection probability model was given by {xi,X 2 ,X 4 } and 
appeared with a frequency of 99.73%. The most frequent intensity model 
was defined by {rc6,rcio} and had a frequency of 89.92%. In addition, the 
most frequent grouping for dispersion parameters isA4j^ = {{2,4},{l,3,5}}, 
which appeared with a frequency of 72.51%. In all cases, the RJMCMC 
correctly identified the set of important covariates as well as grouping for 
dispersion parameters with the posterior marginal probability greater than 
or equal to 72.51%. In terms of parameter estimation, in most cases the 
95% credible intervals (CIs), averaged over the different models, contain the 
true values—providing further indication that the correct model is selected 
with high probability. For model S2, the most frequent set of covariates 
for the detection probability model was given by {xi,X 2 -,X 4 \ and appeared 
with a frequency of 99.57%. The most frequent set of covariates 
for the intensity model had a frequency of 93.57%. In addition, the most 
frequent grouping for the dispersion parameters is = {{2,4}, {1,3,5}}, 
which appeared with a frequency of 76.00%. 

In snmmary, the two simulations suggest that we are able to correctly 
identify important covariates and grouping for dispersion parameters with 
high posterior probability. Finally, for the estimation of abundance in the two 
simulations, onr approach performs satisfactorily, as measured by coverage 
of the 95% CIs. In the presence of spatial components, however, we note 
that the model averaged estimates of dispersion parameters can be adversely 
affected by missing data. 
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Table 1 

Posterior marginal probabilities of the most probable model 
for X, w, and v> in the Bin-CMP mixture models SI and S2 
simulated examples (Section 5) using RJMCMC. Note that 
SI contains only the covariates in the intensity model, 
whereas S2 contains both covariates and spatial components 
in the intensity model and that the posterior probability for x 
under both SI and S2 are slightly less than 1.00 and become 
1.00 as a result of rounding 


(a) Variable selection and grouping for SI 


Para- 



Posterior 

meter 

Model 

Frequency probability 

X 

{* 1 , 2 : 2 ,* 4 } 

59,838 

1.00 

w 

{me, wio} 

53,951 

0.90 


{W2, W6,«:io} 

4386 

0.07 

V 

ri={2,4},r2 = {l,3,5} 

43,507 

0.73 


ri = {i,3},r2 = {2,4},r3 = {5} 

7801 

0.13 


ri={l},T2 = {2,4},r3 = {3,5} 

3918 

0.07 

(b) Variable selection and gronping for S2 

Para- 



Posterior 

meter 

Model 

Frequency probability 

X 

{* 1 , 2 : 2 ,* 4 } 

59,741 

1.00 

w 

{ 1 * 6 , 2010 } 

56,139 

0.94 

V 

ri = {2,4},r2 = {i,3,5} 

37,071 

0.76 


ri={3},T2={2,4},r2 = {l,5} 

7573 

0.13 


6. Application: The Baltimore Ecosystem Study. In the urban ecosys¬ 
tems literature, bird communities are often used as surrogates for studying 
urban biodiversity or species responses to urbanization [Shochat, Lerman 
and Fernandez-Juricic (2010), Aronson et al. (2014)]. Within urban areas 
the bird community is shaped by local-scale features such as habitat fea¬ 
tures that vary among neighborhoods, landscape pattern, and socioeconomic 
characteristics of residents that may influence land management decisions 
[Pickett et al. (2012)]. The American Community Survey (ACS) is an ongo¬ 
ing survey that is able to provide timely economic, social, and demographic 
information on small geographies such as census tracts. Thus, to examine 
the effects of certain demographic characteristics on abundance, we con¬ 
sider several ACS variables. Additionally, environmental features of differ¬ 
ent neighborhoods can be described by many factors, such as vegetation 
diversity and are, therefore, also considered in our analysis. 
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Substantial research has been undertaken to investigate how socioeco¬ 
nomic status and environmental variables influence the abundance and di¬ 
versity of various avian species [see Loss, Ruiz and Brawn (2009), Small- 
bone, Luck and Wassens (2011), Denison (2010) and the references therein]. 
Using socioeconomic variables from the decennial census in 2000 associated 
with each census tract block groups as covariates, Denison (2010) considered 
a simple NB regression with no spatial components under the frequentist 
paradigm to estimate the relative abundance for European starling in the 
City of Baltimore, Maryland using a portion of data collected from 2005 to 
2007. In contrast, we consider American Robin data from the BES collected 
from 2005 to 2009 and apply various Bin-CMP models in order to select im¬ 
portant covariates for estimating the detection probability and abundance 
of the American Robin, as well as to identify the grouping of dispersion 
parameters. Due to missing values, the data we consider has an unbalanced 
structure. In particular, the percentage of secondary sampling occasions with 
at least one missing observation for each of five primary sampling occasions is 
6.87%, 6.87%, 3.05%, 77.1%, and 50.38%, respectively. Moreover, the overall 
percentage of missing observations in the American Robin data set is 9.62%. 

Eor the American Robin data, a total of 131 bird survey points were 
visited during three secondary daily surveys within each of the five primary 
sampling occasions from 2005 to 2009. With three covariates available, we 
considered a full model for the detection probability as 

(12) logit(pijfc) = /5i + + /^sairtempjj-fc -P /34cloudcoverjjfc, 

for z = 1,..., 131, j = 1,..., 5, and A; = 1,..., njj < A = 3. Regarding the 
covariates in (12), time, airtemp, and cloudcover correspond to the start 
time, air temperature, and cloud cover (i.e., the fraction of the sky obscured 
by clouds) recorded on each visit to the bird survey points, respectively. 

In terms of full models for the intensity, we considered the following three 
models: 

Ml: log Xij = w'7 -P 0-'q: -P yoj, 

M2: logAij = w'7-P7oj, 

M3: log Xij = 4>*'a + -/qj , 

where, for j = 1,... ,J, -joj is a year-specific intercept and (f)*' is the zth row 
of the matrix as discussed in Section 3. Moreover, the covariates in the 
intensity model are given by w( = (uftreej, ufbldgj, ufmgrasSj, bld200mj, 
for200mj, veg200mj, Africanj, bachelorj, fmkdsj, pubassitj, houseyrj. 
These covariates are specific to each survey location and do not vary with 
primary sampling occasions. Among these environmental variables, uf tree, 
ufbldg, and ufmgrass are the UFORE plots variables that indicate tree 
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cover, ground cover by buildings and maintained grass, respectively. Fur¬ 
ther, bld200m, f or200m, and veg200m are variables that measure tree cover, 
other vegetation cover, and cover by buildings in the 200 meter radius plot, 
respectively [see Figure 1 in the supplemental article, Wu et al. (2015)]. 
For the ACS variables specific to each census tract block group, African 
is the percentage of African American residents; bachelor is the percent¬ 
age of population with Bachelor’s degree or higher; fmkds is the percent¬ 
age of housing units occupied by female householder and children under 
18 years; pubassit is the percentage of households on government public 
income assistance; hourseyr is the median year that a housing unit was 
built. We used the five-year period estimates from 2005 to 2009 for these 
ACS variables, which can be obtained at the U.S. Census Bureau website 
(http://www.census.gov/acs/www/). Our specific choice of ACS variables 
was facilitated by a social areas analysis approach [Denison (2010), Mal¬ 
oney and Auffrey (2013), Muller et al. (2013)]. Note that we standardize 
the covariates in (12) and in the intensity model for numerical stability. 
Further, based on exploratory analysis involving various collinearity diag¬ 
nostics (e.g., condition number, etc.) of the site covariates (not shown) and 
subject matter knowledge, we expect any effects of collinearity between the 
site covariates to have a minimal affect on the variable selection algorithm. 
Finally, for model Ml, we orthogonalize the matrix of spatial basis function 
with respect to covariates, to alleviate potential confounding with the co¬ 
variate effects [Hodges and Reich (2010)]. As a result, </>*' is the ith row of 
the matrix of after the orthogonalization. 

It is worth pointing out that the choice of models above depends on the 
goal of the ecological study. For example, M3 can be used if no covariates 
are available for modeling the intensity. For other cases where covariates are 
available, but there is no spatial dependence (or the spatial dependence is 
negligible after accounting for covariates), model M2 can be utilized. Given 
both covariates are available and spatial dependence is present, Ml repre¬ 
sents a potential model. 

When implementing the RJMCMC algorithm, we require the “intercept” 
term j3\ in (12) and 7 q, in the model for intensity, to be included with prob¬ 
ability one. In addition, in the presence of spatial components, we require a 
to be in the model for the intensity with probability one. For the choice of 
knot points, when using low-rank thin plate basis functions, we considered a 
sensitivity analysis to choose the number of knots and a space-filling design 
for placement. Specifically, for three different choices of the number of knot 
points, T = 10, 15, and 32 in Ml, similar results are obtained in terms of 
abundance estimation, although parameter estimation becomes more diffi¬ 
cult as r gets large. Equally important, the results of a sensitivity analysis 
indicate that the variable selection and grouping for the dispersion param¬ 
eters seem robust to a different number of knot points. Hence, we choose 
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r = 10 for both Ml and M3. We used a Metropolis-Hastings within Gibbs 
sampler consisting of a total of 120,000 MCMC iterations, with the first 
60,000 discarded as burn-in. Our inference is based on every third sample 
after burn-in, which results in a total of 20,000 samples used. 

In terms of posterior marginal probability, the model having time and 
cloudcover has the highest probability of being selected in the model for de¬ 
tectability. Similarly, for the intensity model, uf bldg, veg200m, and pubassit 
are selected with higher probability relative to other covariates. However, 
the grouping of dispersion parameters varies across models depending on 
whether spatial components are included. This is not unexpected, as there 
is a trade-off between the dispersion parameter and inclusion of spatial com¬ 
ponents. The three models we considered all produce similar results in terms 
of the selection of important covariates and abundance estimates (results not 
shown). However, since the goal of our analysis is to identify and draw in¬ 
ference on important covariates relating to detectability and abundance, we 
present results from the more parsimonious model M2. From Table 2, it can 
be seen that time and cloudcover are identihed as important predictors for 
detectability of American Robin. For the covariates in the intensity model, 
ufbldg, veg200m, and pubassit are selected as the important factors in all 
cases. For the dispersion parameters, the results suggest the most probable 
model has the grouping Ti = {2,4}, T 2 = {1 ,3,5} (with posterior probability 
0.6496), indicating that the data in 2005, 2007, and 2009 exhibit a simi¬ 
lar amount of dispersion, whereas the data for 2006 and 2008 show similar 
amounts of dispersion. 

Last, we consider the posterior mode model (i.e., the model with the 
highest posterior probability) for the Bin-CMP mixture model M2 in order 
to draw inference about how the different covariates affect high detectabil¬ 
ity and abundance of the American Robin within the study domain. We 
conclude that an important covariate is a positively (or negatively) signif¬ 
icant factor if the lower (or upper) end of 95% CIs is greater (or smaller) 
than 0, respectively. For the posterior mode model, we include only the in¬ 
tercept, time, and cloudcover in (12), whereas for the covariates in the 
intensity model, only ufbldg, veg200m, and pubassit are included. For 
the dispersion parameters, we consider the case where 1^2 = ^4 = ^ 2 A and 

= 1^3 = = 1^135- Table 2 presents the posterior summary statistics and 

Gelman-Rubin diagnostics [Brooks and Gelman (1998)] for model parame¬ 
ters. It is shown that in all cases R is close to 1, indicating convergence has 
been reached. Moreover, time is negatively correlated with the detectabil¬ 
ity of the American Robin, that is, the earlier the survey is conducted, the 
more likely it is that we can detect American Robin. In terms of the intensity, 
ufbldg is negatively related to the abundance of American Robin, whereas 
veg200m and pubassit are positively related. As a result, for bird survey 
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Table 2 

Posterior probabilities of the most probable model for M2 and 
the posterior summary statisties in the Bin-CMP model 
assuming the posterior mode model for M2. Note that M2 
only contains covariates in the intensity model, and R refers 
to the Gelman-Rubin diagnostics 


(a) Variable selection and grouping 


Variable Model 

Posterior 
Frequency probability 

X 

{cloudcover} 

31,992 

0.53 


{time, cloudcover} 

27,587 

0.46 

w 

{veg200m, pubassit} 

51,343 

0.86 


(ufbldg, veg200m, pubassit} 

7234 

0.12 

V 

ri = {2,4},r2 = {1,3,5} 

38,973 

0.65 


ri = {2},r2 = {i,3,4,5} 

7445 

0.12 


Ti = {2},r2 = {4},T2 = {l,3,5} 

3745 

0.06 


(b) Parameter estimation 

Parameter 

/^post 

^ post 

Qo.025 

Qo.075 

R 

intercept 

-2.31 

0.07 

-2.45 

-2.17 

1.00 

time 

-0.10 

0.03 

-0.15 

-0.04 

1.00 

cloudcover 

-0.04 

0.03 

-0.09 

0.01 

1.00 

ufbldg 

-0.02 

0.01 

-0.03 

-0.01 

1.00 

veg200m 

0.06 

0.01 

0.05 

0.09 

1.00 

pubassit 

0.02 

0.01 

0.01 

0.04 

1.00 

701 

0.35 

0.07 

0.23 

0.51 

1.01 

702 

0.16 

0.05 

0.07 

0.26 

1.01 

703 

0.48 

0.08 

0.33 

0.67 

1.01 

704 

0.14 

0.05 

0.05 

0.24 

1.01 

705 

0.38 

0.07 

0.25 

0.55 

1.01 

1^24 

0.08 

0.02 

0.05 

0.11 

1.01 

1^135 

0.17 

0.03 

0.12 

0.23 

1.01 


points nearby more buildings, the abundance of American Robin is lower; 
while for survey points with a higher percentage of vegetation and residents 
of lower socio-economic status, the abundance of American Robin is higher. 
As an example. Figure 2 provides a spatial map for the posterior mean and 
standard deviation of the abundance estimate (from M2) for 2009, whereas 
Figures 2 and 3 of the supplemental article [Wu et ah (2015)] illustrate how 
the abundance estimates and their standard errors change over the duration 
of the period studied (2005-2009). Last, our results suggest that the Amer- 
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Fig. 2. Plots of posterior mean and standard deviation of abundance estimates for 2009 
in the Bin-CMP model assuming the posterior mode model for M2. Note that M2 only 
contains covariates in the intensity model, (a) Posterior mean, (b) posterior sd. 
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ican Robin are overdispersed within the study domain over all of the years 
considered. 

7. Discussion. Motivated by the American Robin data from the BES, 
we developed a class of Bayesian hierarchical binomial mixture models that 
allow for automated variable selection and grouping in the presence of un¬ 
balanced nested design. In addition, we demonstrate that over- and under¬ 
dispersion in the data can be accounted for by specifying an appropriate 
model for the abundance parameter, namely, a Bin-CMP model. More im¬ 
portantly, we allow for large-scale spatial dependence to be accounted for by 
adding a spatial component to the intensity model (i.e., through a spatial 
basis function expansion). Under the binomial mixture modeling framework, 
the use of a low-rank spatial representation proves to be a computationally 
advantageous approach to building in spatial dependence. 

Although we have presented a model (M2) that accounts for covariate 
information, spatial maps that predict abundance at unobserved locations 
could be obtained using model M3 and thereby take advantage of the spline 
formulation. In contrast, both models Ml and M2 would require imputation 
of covariates at unobserved locations (i.e., additional data models) to predict 
abundance at unobserved locations. Consequently, since our goal is primarily 
inferential, this direction has not been pursued here. 

The class of binomial mixture models we consider assume population 
closure within each primary sampling period. Such an assumption is often 
justihed based on biological and/or ecological considerations, when the pri¬ 
mary sampling period covers a relatively short time frame. In our case, the 
justihcation of the closed population assumption is based on ecological con¬ 
siderations. However, it may also be possible to extend our model to verify 
the assumption of population closure following the framework of Bail and 
Madsen (2011) by decomposing the true abundance into the sum of two 
independent components, that is, the total number of survivors from the 
previous sampling period (by introducing a survival rate parameter in the 
model) and new additions prior to the current sampling period (by intro¬ 
ducing a birth parameter in the model). This is a subject of future research. 

Although the binomial mixture models we propose can accommodate un¬ 
balanced data structures, the amount of missing data can impact model 
selection and parameter estimation. As discussed in the second simulated 
example, the model averaged estimates for dispersion parameters are posi¬ 
tively biased when the simulated data exhibit the same missing pattern as 
the American Robin data and spatial components are included to account 
for spatial dependence in the intensity model. Nevertheless, we note that 
grouping of dispersion parameters leads to a “borrowing of strength,” since 
data collected over different years are pooled together if the corresponding 
dispersion parameters fall into the same group. In other words, this pooling 
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of data helps mitigate the negative impacts of missing values. In general, a 
comprehensive assessment of the effect of missing data is problem specific 
and depends on both the pattern of missingness and the underlying spa¬ 
tial dependence (e.g., the effective sample size). In practice, we advocate 
evaluating these effects through simulated data examples, similar to those 
conducted here. 

It is important to note that all of the models we considered for the Amer¬ 
ican Robin data provide similar results regarding the identification of im¬ 
portant covariates for detectability and intensity, as well as the grouping 
of dispersion parameters. First, time, and cloudcover are identihed to be 
important covariates for high detectability of the American Robin, with the 
former being negatively related to observing the American Robin. However, 
one should be careful when interpretating cloudcover due to the difficulty 
in estimating it objectively [Vignola, Michalsky and Stoffel (2012)]. On the 
other hand, ufbldg, veg200m, and pubassit are found to be important 
predictors for abundance of the American Robin. In terms of dispersion, the 
American Robin data demonstrates overdisperion. Importantly, the class of 
binomial mixture models we propose is of independent interest and when 
coupled with the CMP distribution can be used in cases where the type of 
dispersion (i.e., over- and underdispersion) varies over time. In this sense, 
the Bin-CMP mixture model is extremely versatile, as it can be used for 
modeling equi-, over-, and underdispersed data (e.g., for modeling abun¬ 
dance of less prevalent species, such as the Eastern Wood Pewee or Wood 
Thrush in the BES). 
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SUPPLEMENTARY MATERIAL 

Supplement to “Bayesian binomial mixture models for estimating abun¬ 
dance in ecological monitoring studies” (DOL 10.1214/14-AOAS801SUPP; 
.pdf). The supplementary material contains the MCMC sampling algorithm, 
details regarding computation times for the models implemented, and addi¬ 
tional figures. 
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